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Abstract 

Motivated by applications of large-scale graph clustering, we study random- walk-based local 
algorithms whose running times depend only on the size of the output cluster, rather than the 
entire graph. In particular, we develop a method with better theoretical guarantee compared to 
all previous work, both in terms of the clustering accuracy and the conductance of the output 
set. We also prove that our analysis is tight, and perform empirical evaluation to support our 
theory on both synthetic and real data. 

More specifically, our method outperforms prior work when the cluster is well-connected. In 
fact, the better it is well-connected inside, the more significant improvement we can obtain. Our 
results shed light on why in practice some random-walk-based algorithms perform better than 
its previous theory, and help guide future research about local clustering. 

1 Introduction 

As a central problem in machine learning, clustering methods have been applied to data mining, 
computer vision, social network analysis. Although a huge number of results are known in this 
area, there is still need to explore methods that are robust and efficient on large data sets, and 
have good theoretical guarantees. In particular, several algorithms restrict the number of clusters, 
or impose constraints that make these algorithms impractical for large data sets. 

To solve those issues, recently, local random-walk clustering algorithms |ST04l IACL06J have 
been introduced. The main idea behind those algorithms is to find a good cluster around a specific 
node. These techniques, thanks to their scalability, has had high impact in practical applications 
ILLDM09L IGLMYllL IGS121 IAC;M121 ILLMlOl IWLS+12| . Nevertheless, the theoretical understand- 
ing of these techniques is still very limited. In this paper, we make an important contribution in 
this direction. First, we relate for the first time the performance of these local algorithms with the 
internal connectivity of a cluster instead of analyzing only its external connectivity. This change of 
perspective is relevant for practical applications where we are not only interested to find clusters 
that are loosely connected with the rest of the world, but also clusters that are well-connected 
internally. In particular, we show theoretically and empirically that this internal connectivity is a 
fundamental parameter for those algorithms and, by leveraging it, it is possible to improve their 
performances. 

Formally, we study the clustering problem where the data set is given by a similarity matrix 
as a graph: given an undirectecrj graph G = {V,E), we want to find a set S that minimizes the 

'Part of this work was done when the authors are at Google Research New York City. An extended abstract of 
this paper has appeared in the proceedings of the 30th International Conference on Machine Learning (ICML) 2013. 
'^AU our results can be generalized easily to consider weighted graphs. 



relative number of edges going out of S with respect to the size of S (or the size of 5 if 5 is larger 
than S). To capture this concept rigorously, we consider the cut conductance of a set S as: 

MS) - 



min{vol(5), vol(5)} 



where vol(5') = l^„g5deg(t;). Finding S with the smallest (j)ciS) is called the conductance min- 
imization. This measure is a well-studied measure in different disciplines |SM001 IST04| IACL06| 
IGLMYlHlGS12| . and has been identified as one of the most important cut-based measures in the lit- 
erature |Sch07| . Many approximation algorithms have been developed for the problem, but most of 
them are global ones: their running time depends at least linearly on the size of the graph. A recent 
trend, initiated by Spielman and Teng [STUi] , and then followed by [5T081 IXHIM IAP091 R?TT2] . 
attempts to solve this conductance minimization problem locally, with running time only dependent 
on the volume of the output set. 

In particular, if there exists a set A C V with (pc{A) < ^, these local algorithms guarantee 
the existence of some set A^ <^ A with at least half the volume, such that for any "good" starting 
vertex v G A^, they output a set S with conductance 4>c{S) = 0{V^). 

Finding Well- Connectedness Clusters. All local clustering algorithms developed so far, both 
theoretical ones and empirical ones, only assume that (pdA) is small, i.e., A is poorly connected 
to A. Notice that such set A, no matter how small </>c(^) is, may be poorly connected or even 
disconnected inside. This cannot happen in reality if A is a "good" cluster, and in practice we are 
often interested in finding mostly good clusters. This motivates us to study an extra measure on 



A, that is the connectedness of A, denoted as Conn(A) and we will define it formally in Section 2 
We assume that, in addition to prior work, the cluster A satisfies the gap assumption 

r. / ^^ dot Conn(74) ^ „ ,,. 
Gap = Gap(^) = ^^^ >n{l) , 

which says that A is better connected inside than it is connected to A. This assumption is partic- 
ularly relevant when the edges of the graph represent pairwise similarity scores extracted from a 
machine learning algorithm: we would expect similar nodes to be well connected within themselves 
while dissimilar nodes to be loosely connected. As a result, it is not surprising that the notion of 
connectedness is not new. For instance |KVV04J studied a bicriteria optimization for this objective. 
However, local algorithms based on the above gap assumption is not well studiedjj 

Our Results. Under the gap assumption Gap > ^2(1), can we guarantee any better cut conduc- 
tance than the previously shown O(v^) ones? We prove that the answer is in affirmative, along 
with some other desirable properties. In particular, we prove: 

Theorem 1. If there exists a non-empty set A C V such that (pd^) < ^ and Gap > i^{l), then 
there exists some A^ <^ A with vollA^) > 7^vol{A) such that, when choosing a starting vertex v £ A^ , 
the PageRank-Nibble algorithm outputs a set S with 

1. voliS\A)<0{^)voliA), 

2. vol(A\S)<0(^)vol(A), 

3. MS) < 0(\/^/Gap), and 

with running time O C^.g^ ) — 0( ™i^ )• 



^One relevant paper using this assumption is [MMV12] , who provided a global SDP-based algorithm to approximate 
cut conductance under this assumption. 



We interpret the above theorem as follows. The first two properties imply that under Gap > 
17(1), the volume for vol(5'\^) and vol(^\S') are both small in comparison to vol(^), and the larger 
the gap is, the more accurate S approximates A^ For the third property on the cut conductance 
4>c{S), we notice that our guarantee 0{^yW/Ga^) < 0(VV) outperforms all previous work on local 
clustering under this gap assumption. In addition. Gap might be very large in reality. For instance 
when ^ is a very- well-connected cluster it might satisfy Conn fA) = polylog(n), and as a consequence 
Gap may be as large as f](l/^) 
to the cut conductance. 



In this case our 



Theorem 1 guarantees a polylog(n) approximation 



Our proof of Theorem 1 uses almost the same PageRank algorithm as |ACL06j . but with a 
very different analysis specifically designed for our gap assumption. This algorithm is simple and 
clean, and can be described in four steps: 1) compute the (approximate) PageRank vector starting 
from a vertex v € A^ with carefully chosen parameters, 2) sort all the vertices according to their 
(normalized) probabilities in this vector, 3) study all sweep cuts that are those separating high- 
value vertices from low-value ones, and 4) output the sweep cut with the best cut conductance. 



See Algorithm 1 for details of this algorithm. 
We also prove that our analysis is tight. 

Theorem 2. There exists a graph G = {y,E) and a non-empty A G V with ^ and Gap = 0,(1), 
such that for all starting vertices v £ A, none of the sweep-cut based algorithm on the PageRank 
vector can output a set S with cut conductance better than ©(y^^/Gap). 

We prove this tightness result by illustrating a hard instance, and proving upper and lower 
bounds on the probabilities of reaching specific vertices (up to a very high precision) . Theorem 2 
does not rule out existence of another local algorithm that can perform better than 0{y/W/Ga!p). 
However, we conjecture that all existing (random-walk-based) local clustering algorithms share the 
same hard instance and do not outperform ©(y^'I'/Gap), similar to the classical case where they 
all provide only O(v^) guarantee due to Cheeger's inequality. It is an interesting open question 
to design a flow-based local algorithm to overcome this barrier under our gap assumption. 

Prior Work. Related work is discussed in depth in Appendix G 

Roadmap. We provide preliminaries in Section 2, and they are followed by the high level ideas 



of the proofs for [Theorem 1| in [Section "3 and Section 4 We then briefly describe how to prove our 



tightness result in Theorem 5 , and end this extended abstract with empirical studies in Section 6 



2 Preliminaries 



2.1 Problem Formulation 

Consider an undirected graph G{V,E) with n = |y| vertices and m = \E\ edges. For any vertex 
u £ V the degree of u is denoted by deg(u), and for any subset of the vertices S '^V, volume of S 
is denoted by vol(S') = J2ueS ^^si'^)- Given two subsets A,BcV, let E{A, B) be the set of edges 
between A and B. 

For a vertex set 5" C y, we denote by ^[5*] the induced subgraph of G on S with outgoing edges 
removed, by deg5(n) the degree of vertex u G 5 in G[S\, and by vol5(T) the volume of T C 5 in 
G[S]. 



^Very recently, |WLS"'"12j studied a variant of the PageRank random walk and their first experiment — although 
analyzed in a different perspective — essentially confirmed our first two properties in |Theorem f[ Ifowever, they have 
not attempted to explain this in theory. 



We respectively define the cut conductance and the set conductance of a non-empty set S '^V 
as follows: 



minlvo^S*), vo^S*)} 

,,^,def . \E{T,S\T)\ 

(Ps[j) = mm 



dcTcsmm{volsiT),vols{S\T)} 

Here (j)c{S) is classically known as the conductance of S, and (j)s{S) is classically known as the 
conductance of S on the induced subgraph G[S]. 

We formalize our goal in this paper as a promise problem. Specifically, we assume the existence 
of a non-empty cluster of the vertices A C V satisfying vol(^) < |vol(y) as well as (/)s{A) > ^ 
and (pdA) < ^. This set A is not known to the algorithm. The goal is to find some set S that 
"reasonably" approximates A, and at the same time be local: running in time proportional to 
vo\{A) rather than n or m. 

Our assumption. We assume that the following gap assumption: 



dot 



Conn(A) dcf $71ogvol(^) 



Gap = = > ri(l) (Gap Assumption) 

holds throughout this paper. This assumption can be understood as the cluster A is more well- 
connected inside than it is connected to A. 

(This assumption can be weakened by replacing the definition of Conn(74) with Conn(74) = 
^^. , where T^ix(A) is the mixing time for the relative pointwise distance in G[A]; or less weakly 

Conn(yl) = , oi(A) where X{A) is the spectral gap, i.e., 1 minus the second largest eigenvalue of 



log vol(yl) 



the random walk matrix on G[A]. We will discuss them in Appendix B.) 

Input parameters. Similar to prior work on local clustering, we assume the algorithm takes as 
input: 

• Some "good" starting vertex v € A, and an oracle to output the set of neighbors for any given 
vertex. 

This requirement is essential because without such an oracle the algorithm may have to read 
all inputs and cannot be sublinear in time; and without a starting vertex the sublinear-time 
algorithm may be unable to even find an element in A. 

We also need v to be "good" , as for instance the vertices on the boundary of A may not be helpful 
enough in finding good clusters. We call the set of good vertices A^ C A, and a local algorithm 
needs to ensure that A^ is large, i.e., vol(^^) > 2Vol(^)r 

• The value of ^. 

In practice ^ can be viewed as a parameter and can be tuned for specific data. This is in 
contrast to the value of ^ that is the target cut conductance and does not need to be known by 
the algorithmjj 

• A value voIq satisfying vol(yl) G [voIq, 2volo]jj 



*This assumption is unavoidable in all local clustering work. One can replace this | by any other constant at the 
expense of worsening the guarantees by a constant factor. 

^In prior work when ^ is the only quantity studied, '^ plays both roles as a tuning parameter and as a target. 

^This requirement is optional since otherwise the algorithm can try out different powers of 2 and pick the smallest 
one with a valid output. It blows up the running time only by a constant factor for local algorithms, since the running 
time of the last trial dominates. 



2.2 PageRank Random Walk 

We use the convention of writing vectors as row vectors in this paper. Let A be the adjacency 
matrix of G, and let D be the diagonal matrix with Da = deg(z), then the lazy random walk matrix 
W = ^{I + D~^A). Accordingly, the PageRank vector prs,a, is defined to be the unique solution 
of the following linear equation (cf . |ACL06] ) : 

prs,a = as + (1 - a)prs,aW , 

where a £ (0,1] is the teleport probability and s is a starting vector. Here s is usually a probability 
vector: its entries are in [0, 1] and sum up to 1. For technical reasons we may use an arbitrary (and 
possibly negative) vector s inside the proof. When it is clear from the context, we drop a in the 
subscript for cleanness. 

Given a vertex u £ V, let Xu £ {0, 1} be the indicator vector that is 1 only at vertex u. Given 
non-empty subset S 'ZV we denote by its the degree-normalized uniform distribution on S, that is, 
TTsiu) = ^^3^\ when n G 5 and otherwise. Very often we study a PageRank vector when s = Xv 
is an indicator vector, and if so we abbreviate pr^„ by pr^. 

One equivalent way to study prg is to imagine the following random procedure: first pick 
a non-negative integer t G Z>o with probability a(l — a)*, then perform a lazy random walk 
starting at vector s with exactly t steps, and at last define prg to be the vector describing the 
probability of reaching each vertex in this random procedure. In its mathematical formula we have 
(cf. [H^OHECLOS]): 

Proposition 2.1. prs = as + a Ylt^ii'^ ~ aYisW^). 

This implies that pvs is linear: a ■ prg + b ■ prt = pvas+bt- 

2.3 Approximate PageRank Vector 

In the seminal work of |ACL06] . they defined approximate PageRank vectors and designed an 
algorithm to compute them efficiently. 

Definition 2.2. An e-approximate PageRank vector p for prg is a nonnegative PageRank vector 
p = prs~r where the vector r is nonnegative and satisfies r{u) < edeg(n) for all u £V. 

Proposition 2.3. For any starting vector s with \\s\\i < 1 and e G (0,1], one can compute an 
e-approximate PageRank vector p = pvs-r for some r in time O (— ), with vol(supp(p)) < /i_^\ ■ 



For completeness we provide the algorithm and its proof in Appendix C It can be verified that: 



Vn G V, prs{u) > p{u) > prs{u) — edeg(n) . (2.1) 



2.4 Sweep Cuts 



Given any approximate PageRank vector p, the sweep cut (or threshold cut) technique is the 
one to sort all vertices according to their degree- normalized probabilities gf^^r, and then study 
only those cuts that separate high-value vertices from low-value vertices. More specifically, let 
^1,^2, . . . ,fn be the decreasing order over all vertices with respect to gf^j^- Then, define sweep 

sets S^ = {vi, . . . ,Vj} for each j £ [n], and sweep cuts are the corresponding cuts (S'J, S^). Usually 
given a vector p, one looks for the best cut: 

min (pc{S^) ■ 



In almost all the cases, one only needs to enumerate j over p{vj) > 0, so the above sweep cut 
procedure runs in time 0(vol(supp(p)) + |supp(p)| • log |supp(p)|). This running time is dominated 



by the time to compute p (see Proposition 2.3), so it is negligible. 



2.5 Lovasz-Simonovits Curve 

Our proof requires the technique of Lovdsz-Simonovits Curve that has been more or less used in 
all local clustering algorithms so far. This technique was originally introduced by Lovasz and 
Simonovits |LS90| ILS93| to study the mixing rate of Markov chains. In our language, from a 
probability vector p on vertices, one can introduce a function p[x] on real number x £ [0, 2m]. This 
function p[x] is piecewise linear, and is characterized by all of its end points as follows (letting 

p[{)\ = (J, p[vo[[b^)\ = p{jj) tor each j E [nj . 
In other words, for any x G [vol(5p, vol(5^_|_-^)], 

Note that p[x] is increasing and concave. 

3 Guarantee Better Accuracy 

In this section, we study PageRank random walks that start at a vertex v G A with teleport 
probability a. We claim the range of interesting a is [r2(^), 0(j^^-^)] . This is because, at a high 

level, when a ^ ^ the random walk will leak too much to A; while when a 3> -, the random 

log n 

walk will not mix well inside A. In prior work, a is chosen to be 0(^), and we will instead choose 
a = 0(w^) = 0(^-Gap). Intuitively, this choice of a ensures that under the condition the random 
walk mixes inside, it makes the walk leak as little as possible to A. We prove the above intuition 
rigorously in this section. Specifically, we first show some properties on the exact PageRank vector 
in Section 3.1 , and then move to the approximate vector in Section 3.2 This essentially proves the 



first two properties of Theorem 1 



3.1 Properties on the Exact Vector 

We first introduce a new notation pr^, that is the PageRank vector (with teleport probability a) 
starting at vector s but walking on the subgraph G[A]. 

Next, we choose the set of "good" starting vertices A^ to satisfy two properties: (1) the total 
probability of leakage is upper bounded by ^, and (2) pr^, is close to pr^ for vertices in A. Note 
that the latter implies that pr^ mixes well inside A as long as pr^ does so. 

Lemma 3.1. There exists a set A^ C A with volume vol(^^) > 2^ol{A) such that, for any vertex 
V G A^ , in a PageRank vector with teleport probability a starting at v, we have: 

y^pr,{u)<— . 3.1 

In addition, there exists a non-negative leakage vector / G [0, 1]^ with norm \\l\\i < — satisfying 

\/u £ A, pry{u) > pry{u) — pri{u) . (3-2) 



(Details of the proof are in Appendix D.l 



Proof sketch. The proof for the first property |(3.1) is classical and can be found in |ACL06j . The 
idea is to study an auxiliary PageRank random walk with teleport probability a starting at the 
degree- normalized uniform distribution tta, and by simple computation, this random walk leaks 
to A with probability no more than ^/a. Then, using Markov bound, there exists A^ C. A with 
vol(^^) > 2Vol(^) such that for each starting vertex v G A^, this leakage is no more than — . This 



implies (3.1) immediately. 



The interesting part is (3.2 



Note that pr-u can be viewed as the probability vector from the 
following random procedure: start from vertex v, then at each step with probability a let the walk 
stop, and with probability (1 — a) follow the matrix W to go to one of its neighbors (or itself) 
and continue. Now, we divide this procedure into two rounds. In the first round, we run the same 
PageRank random walk but whenever the walk wants to use an outgoing edge from A to leak, 
we let it stop and temporarily "hold" this probability mass. We define / to be the non-negative 
vector where l{u) denotes the amount of probability that we have "held" at vertex u. In the second 
round, we continue our random walk only from vector /. It is worth noting that / is non-zero only 
at boundary vertices in A. 

Similarly, we divide the PageRank random walk for pr^ into two rounds. In the first round we 
hold exactly the same amount of probability l{u) at boundary vertices u, and in the second round 
we start from / but continue this random walk only within G[^]. To bound the difference between 
pr^ and j5f„, we note that they share the same procedure in the first round; while for the second 
round, the random procedure for pry starts at I and walks towards V \ A (so in the worst case 
it may never come back to A again), while that for pr^ starts at / and walks only inside G[A] so 
induces a probability vector pri on A. This gives |(3.2) 

At last, to see \\l\\i < —, one just needs to verify that l{u) is essentially the probability that 
the original PageRank random walk leaks from vertex u. Then, ||/||i < — follows from the fact 
that the total amount of leakage is upper bounded by -— . D 



As mentioned earlier, we want to use (3.2) to lower bound pry{u) for vertices u £ A. We achieve 
this by first lower bounding pr^ which is the PageRank random walk on G[^]. Given a teleport 



probability a that is small compared to 



(j,^ 



logvol(A) 



, this random walk should mix well. We formally 



state it as the following lemma, and provide its proof in the Appendix D.2 



Lemma 3.2. When a < 0(^ • Gap) we have that 

Vn G A, pr^iu) > 



4deg^(n) 



5 vol(.4) 
Here deg^(n) is the degree of u on G[A], but vol(^) is with respect to the original graph. 



3.2 Properties of the Approximate Vector 

From this section on we always use a < 0{^ ■ Gap). We then fix a starting vertex v £ A^ and study 
an e-approximate Pagerank vector for pry. We choose 



1 



10 • vol 



G 







1 



1 



20vol(A) ' lOvol(^) 



(3.3) 



For notational simplicity, we denote by p this e-approximation and recall from Section 2.3 that 
p = pr^^-r where r is a non-negative vector with < r{u) < edeg(n) for every u €z V. Recall from 
(2.1) that pry{u) > p{u) > pry{u) — e • deg(ii) for all u ^V . 



We now rewrite [LemmaXT] in the language of approximate PageRank vectors using Lemma 3.2 



Corollary 3.3. For any v €z A^ and a < 0(^ • Gap), in an e-approximate PageRank vector to pr^ 
denoted by p = pr^^-r, we. have: 

p{u) < — and > riu) < — . 



In addition, there exists a non-negative leakage vector / G [0,1]^ with norm \\l\ 

4deg^(u) deg('u) 



< 



2>I' 



satisfying 



yu € A, p(u) > —7^ — — — prj(u) . 

^^ ^ - 5 vol(A) lOvol(A) ^ '^ ^ 

Proof. The only inequality that requires a proof is X]u^A^(^) — ~a- ^^ fact, if one takes a closer 
look at the algorithm to compute an approximate Pagerank vector (cf. [Appendix C ), the total 
probability mass that will be sent to r on vertices outside A, is upper bounded by the probability 
of leakage. However, the latter is upper bounded by — when we choose A^. D 

We are now ready to state the main lemma of this section. We show that for all reasonable 
sweep sets S on this probability vector p, it satisfies that vo^S* \ A) and vol(^ \ S) are both at most 
0(fvol(^)). 



Lemma 3.4. In the same definition of a and p from Corollary 3.3: let sweep set Sc = {n € V : 



p{u) > C y^^^^N } for any constant c < |, then we have the following guarantees on the size of Sc\A 
and A\ Sc: 

1. vol(5,\yl)<^vol(^), and 



2. Yo\{A \ Sc) < 



2^ 



+ 8* vol(A). 



Mi-) 
Proof. First we notice that p{Sc \ A) < p{V \ ^) < ^ 



owing to 



vertex u ^ Sc\ A \t must satisfy p{u) > c^^^r^. 



Corollary 3.3, and for each 

Those combined imply vol(S'c \ A) < — vol(A) 
proving the first property. 

We show the second property in two steps. First, let vl^ be the set of vertices in A such that 

. Any such vertex u ^ Ab must have deg^(M) < I deg(M). This implies 



4 deg^{u) 
5 vol{A) 



degju) 3 deg(u) 



< 



lOvol(A) ^ 5 vol{A) 

that u has to be on the boundary of A and vol(^ t,) < 8^vol(yl) 



Corollary 3.3 



again) p{u 
> ( 



> 



3 deg{u) 



pri{u). If 



5 vol(A) 

c] ..°r/\ . As a consequence. 



vol(A) 



vol(A). At last. 



Next, for a vertex u G j4 \ A^ we have (using 

we further have u ^ Sc so p{u) < c ^'^f^^l , it implies that pri{u) 

the total volume for such vertices (i.e., vol{A \ (Af, U Sc))) cannot exceed 3-^ 

we notice that pf; is a non-negative probability vector coming from a random walk procedure, so 

Up?";!!! = IK 111 < 77- This in sum provides that 

vol{A \ Sc) < vol{A \ {Ab U 5c)) + vol(A) 
2^ 



< 



a(| 



+ 8^ vol(A) . D 



Note that if one chooses a = Q{^ ■ Gap) in the above lemma, both those two volumes are at most 



0(vol(A)/Gap) satisfying the first two properties of Theorem 1 



4 Guarantee Better Cut Conductance 



In the classical work of [ACL06] . they have shown that when a = 0(^), among all sweep cuts on 
vector "p there exists one with cut conductance 0{y/'^\o^n). In this section, we improve this result 
under our gap assumption Gap > $7(1). 



Lemma 4.1. Letting a = 0(^ • Gap), among all sweep sets Sc = {u € V : p{u) > c^f^^ /' 



c G 



1 1 

-8' 4. 



, there exists one, denoted by Sc*, with cut conductance 4>c{Sc*) = ©(y^^/Gap). 



1(A) , 



or 



Proof sketch. To convey the idea of the proof, we only consider the case when p = pr^ is the exact 
PageRank vector, and the proof for the approximate case is a bit more involved and deferred to 
[Appendix E.l[ 

Suppose that all sweep sets Sc for c G [|, j] satisfy \E{Sc,V \ Sc)\ > Eq for some value £"0, 
then it suffices to prove Eq < 0(-7^)vol(^). This is because, if so, then there exists some Sc* 
with \E{Sc*,V \ Sc*)\ < Eq and this combined with the result in Lemma 3.4 (i.e., vol(S'c*) = 
(1 ± 0(l/Gap))vol(^)) gives 



:iSc*) < O 



En 



vol{Sc*] 



0(^/Va) = 0(V*/Gap) 



We introduce some classical notations before we proceed in the proof. For any vector q we denote 



(a, b) £ E we let p{e) = p{a, b) 

dof 



dcg(a) ' 



by q{S) = Yliues^^'^)- ^Iso, given a directed edgf 

and for a set of directed edges E' we let p[E') = Y2eGE' P(^)- ^^ ^^^° ^^^ E{A,B) = {{a,b) S 
E \ a £ A Ab £ B} be the set of directed edges from A to B. 
Now for any set Si/4^ ^ S CI S^/g, we compute that 

piS) = pr.iS) = ax.iS) + (1 - a){pW)iS) 
< a + {1 - a){pW){S) 
^ (1 - a)piS) < a(l - p{S)) + (1 - a)ipW)iS) 
^ (1 - a)p{S) < 2^ + (1 - a){pW){S) 

=^ piS) < O(^) + ipW){S) . (4.1) 

Here we have used the fact that when p = pr^ is exact, it satisfies 1 — p{S) = p{V — S) < 2^/a 
according to [Corollary 3.3 In the next step, we use the definition of the lazy random walk matrix 
W to compute that 

{pW){S) 

p{a,b)+p{b,a) 



J2 Pia,b)+ J2 

(a,b)G-E(S,S) {a,b)<EE{S,S) 

Ip{e{s, s)) + Ip{e{s, S) U E{S, S) U E{S, S) 



<ilp 



1 



^p 



\E{S,S)\ 



yo\{S)- \E{S,S) 



E{S,S)UE{S,S)UE{S,S) 
1 



^P 



YoliS) + \E{S,S)\ 



< ( ^p[vo\{S) - Eo] + Ip[vo\{S) + Eo] 



(4.2) 



^G is an undirected graph, but we study undirected edges with specific directions for analysis purpose only. 



Digit 





1 


2 


3 


4 


5 


6 


7 


8 


9 


* = <t>c{A) 


0.00294 


0.00304 


0.08518 


0.03316 


0.22536 


0.08580 


0.01153 


0.03258 


0.09761 


0.05139 


MS) 


0.00272 


0.00067 


0.03617 


0.02220 


0.00443 


0.01351 


0.00276 


0.00456 


0.03849 


0.00448 


Precision 


0.993 


0.995 


0.839 


0.993 


0.988 


0.933 


0.946 


0.985 


0.941 


0.994 


Recall 


0.988 


0.988 


0.995 


0.773 


0.732 


0.896 


0.997 


0.805 


0.819 


0.705 



Table 1: Clustering results on the USPS zipcode data set. We report precision {An 5'|/|5| and 
recall \AnS\/\A\. 

Here the first inequality is due to the definition of the Lovasz-Simonovits curve p[x], and the second 
inequality is because p[x] is concave. Next, suppose that in addition to S1/4 ^ 5 C ^i/g, we also 



know that S" is a sweep set, i.e., Wa £ S,b ^ S we have 



p(") 



> 



p(b) 



and combining (4.1) and (4.2) we obtain that 



deg(a) — dcg(6) 



This implies p{S) = p[vol{S)] 



(4.2) we obtain that 

{p[voliS)]-p[vol{S)-Eo]) 
< 0{^) + {p[vol{S) + Eo] -p[vol{S)]) 



Since we can choose S to be an arbitrary sweep set between S1/4 and Si/s, we have that the 
inequality p[x] —p[x — £^o] < 0{"^)+p[x + Eq]—p[x] holds for all end points x G [vol(S'i/4), vol(S'i/8)] 
on the piecewise linear curve p[x]. This implies that the same inequality holds for any real number 
X £ [vol(S'i/4), vol(5i/8)] as well. We are now ready to draw our conclusion by repeatedly applying 
this inequality. Letting xi := vol(5i/4) and X2 := vol{Si/g), we have 



Eo 



4vol(A) 



<p[xi] -pixi - Eq] 

< 0{^) + {p[xi + Ea] - p[xi]) 

< 2 • 0(*) + {p[xi + 2Eo] - p[xi + Eo]) < 

X2~ X 



< 



< 



< 



< 



Eo 

V0l(S'i/8 \ 51/4 



'- + l\o{-^) + {p[x2 + Eo] 
Eo 



■0(*) 



Eo ^^' ' 8vol(A) 

vol(S'i/8 \ ^) + vol(yl \ 5*1/4^ 

Eo 
0(*/a)-vol(A)^,,^., , Eo 



om 



P[X2]) 



Eo 
8vol(A) 



Eo 



-Oi^) 



8vo\{A) ' 



where the first inequality uses the definition of 51/4, the fifth inequality uses the definition of Si/g, 
and last inequality uses Lemma 3.4 again. After re- arranging the above inequality we conclude 



that Eq < 0(^)vol(^) and finish the proof. 



D 



The lemma above essentially shows the third property of Theorem 1 For completeness of the 
paper, we still provide the proof for [Theorem 1 in Appendix E.2 and summarize our final algorithm 
in Algorithm 1[ 



5 Tightness of Our Analysis 



It is a natural question to ask under our newly introduced assumption Gap > $7(1): is 0(-y/^/Gap) 
the best cut conductance we can obtain from a local algorithm? We show that this is true if one 
sticks to a sweep-cut algorithm using PageRank vectors. 
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r 
a 



£ + 1 vertices 

A 



n/£ edges 
^'n edges 
Wn£/co edges 



I 



I>- -^ + 1 vertices 



Figure 1: Our hard instance for proving tightness. One can pick for instance £ ~ n*^'^ and ^ ~ —gj 



so that n/i 



n 



0.6 



, *n 



n 



0.1 



and ^ni 



n 



0.5 



Algorithm 1 PageRank-Nibble 



Input: u,$ and voIq G [^^^, vo1(^)]. 
Output: set 5. 

1: a ^ e( *' 
2: p •(— a 



G(^ -Gap). 



logvol(A))' 

approximate PageRank vector with starting vertex v and teleport probabihty a. 



lO-volo 



p(m) 



3: Sort ah vertices in supp(p) according to ^^ , ^ . 

4: Consider ah sweep sets S'^ = {u ^ supp(p) : p{u) > 
among them with the best (t>c{S). 



cdcg(n) 
volo 



} for c G [|, 2], and let S be the one 



More specifically, we show that our analysis in Section 4| is tight by constructing the following 
hard instance. Consider a (multi-)graph with two chains (see Figure 1 ) of vertices, and there are 
multi-edges connecting themj^ In particular: 

• the top chain (ended with vertex a and c and with midpoint b) consists of £ + 1 vertices where 
£ is even with j edges between each consecutive pair; 

edges 



^nt. 



• the bottom chain (ended with vertex d and e) consists of |^ + 1 vertices with 
between each consecutive pair, where the constant cq is to be determined later; and 

• vertex h and d are connected with ^n edges. 

We let the top chain to be our promised cluster A. The total volume of A\s2n + ^n, while the 
total volume of the entire graph is 4n + 2\l'n. The mixing time for A is Tanx{A) = 0(^^), and the cut 



conductance <^c(^) = 
is satisfied, i.e., ^£^ 
requirement.) 



vol (A) " 

= 0(11 



2- . Suppose that the gap assumption ^ 



(For instance one can let 



Gap 



1 



n 



°-^and ^ 



<(^)-0c(A) 



~Wl- 



>1 



1 



to achieve this 



®One can transform this example into a graph without parallel edges by splitting vertices into expanders, but that 
goes out of the purpose of this section. 



We are using Theorem 1 in the language of gap assumption on Tmix. See Section 2.1 and Appendix B for details. 
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We then consider a PageRank random walk that starts at vertex v = a and with teleport 
probabihty a = jz for some arbitrarily small constant 7 > Or^\ Let pva be this PageRank vector, 
and we prove in Appendix F the following lemma: 



Lemma 5.1. For any 7 G (0,4] and letting a = 7/^^, there exists some constant cq such that 



when studying the PageRank vector pra starting from vertex a in Figure l\ the following holds 



pra(d) prgjc) 
deg(d) deg{c) 

This lemma implies that, for any sweep-cut algorithm based on this vector pra, even if it 
computes pra exactly and looks for all possible sweep cuts, then none of them gives a better cut 
conductance than 0{^JWJGa^). More specifically, for any sweep set S: 

• if c S, then \E{S,V \S)\ is at least j because it has to contain a (multi-)edge in the top 
chain. Therefore, the cut conductance i;^c('S') > ^{ (^^^(g\ ) ^ ^(j) ^ $^(A/^/Gap); or 
if c £ 6*, the n d must be also in S because it has a higher normalized probability than c using 
In this case, \E{S, V\S)\ is at least -^ because it has to contain a (multi-)edge in 

> Q(*£) = [7(y/^/Gap). 

D 



Lemma 5.1 



CO 



the bottom chain. Therefore, the cut conductance (j)c{S) > il.{ 



vol{S) ■ 



This ends the proof of Theorem 2 



6 Empirical Evaluation 

The PageRank local clustering method has been studied empirically in various previous work. For 
instance, Gleich and Seshadhri |GS12j performed experiments on 15 datasets and confirmed that 
PageRank outperformed many others in terms of cut conductance, including the famous Metis 
algorithm. Moreover, |LLDM09] studied PageRank against Metis+MQI which is the Metis algo- 
rithm plus a flow-based post-processing. Their experiments confirmed that although Metis+MQI 
outperforms PageRank in terms of cut conductance, however, the PageRank algorithm's outputs 
are more "community-like", and they enjoy other desirable properties. 

Since our PageRank-Nibble is essentially the same PageRank method as before with only the- 
oretical changes in the parameters, it certainly embraces the same empirical behavior as those 
literatures above. Therefore, in this section we perform experiments only for the sake of demon- 
strating our theoretical discoveries in [Theorem 1[ without comparisons to other methods. We run 
our algorithm against both synthetic and real datasets, and due to the page limit, we defer the 



details of our experiment setups to Appendix A 



Recall that Theorem 1 has three properties. The first two properties are accuracy guarantees 
that ensure the output set S well approximates A in terms of volume; and the third property 
is a cut- conductance guarantee that ensures the output set S has small (pdS). We now provide 
experimental results to support them. 

In the first experiment, we study a synthetic random graph of 870 vertices. Our desired cluster 
A is constructed from the Watts-Strogatz random model with a parameter /3 G [0, 1] to control 
the connectivity of ^[A]: the larger /3 is the larger Gap is. We therefore present in Figure 2 



our experimental results as two curves, both in terms of /?: the cut conductance over y ratio, i.e. 



MS) 



, and the clustering accuracy, i.e., 1 - 



\AAS\ 



Our experiment confirms our result in 



\v\ . 

PageRank-Nibble indeed performs better both in accuracy and cut conductance as Gap goes larger 



Theorem 1 



^"Although we promised in Theorem 2 to study all starting vertices v £ A, in this version of the paper we only 
concentrate on « = a because other choices of v are only easier and can be analyzed similarly. In addition, this choice 



of Q = ^ is consistent with the one used [Theorem 1 
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Figure 2: Experimental result on the synthetic data. The horizontal axis represents the value of /3 
for constructing our graph, the blue curve (left) represents the ratio ^^ ' , and the red curve (right) 
represents the clustering accuracy. The vertical bars are 94% confidence intervals for 100 runs. 



In the second experiment, we use the USPS zipcode dataset that was also used in the work 
This dataset has 9298 images of handwritten digits between to 9, and we treat 

For each 



from WLS+12 



them as 10 separate binary-classification problems. We report our results in Table 1 



of the 10 binary-classifications, we have a ground-truth cluster A that contains all data points 
associated with the given digit. We then compare the cut conductance of our output set (pdS) 
against the desired cut conductance ^ = 4>c{A), and our algorithm consistently outperforms the 
desired one on all 10 clusters. (Notice that it is possible to see an output set S to have smaller 
conductance than A, because A is not necessarily the sparest cut in the graph.) In addition, one 



can also confirm from Table 1 that our algorithm enjoys high precision and recall. 
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Appendix 



A More Details on Experimental Results 



In this section we discuss in details about our experiments performed in Section 6 

In the first experiment, we study a synthetic graph of 870 vertices. We carefuhy choose the 
parameters as follows in order to confuse the PageRank-Nibblealgorithm so that it cannot identify 
A up to a very high accuracy. We let the vertices be divided into three disjoint subsets: subset A 
(which is the desired set) of 300 vertices, subset B of 20 vertices and subset C of 550 vertices. We 
assume that A is constructed from the Watts-Strogatz modep] with mean degree K = 60 and a 
parameter /3 G [0, 1]: varying f3 makes it possible to interpolate between a regular lattice (/3 = 0) 
that is not-well-connected and a random graph (/3 = 1) that is well-connected. We then construct 
the rest of the graph by throwing in random edges, or more specifically, we add an edge 

• with probability 0.3 between each pair of vertices in B and B; 

• with probability 0.02 between each pair of vertices in C and C; 

• with probability 0.001 between each pair of vertices in A and B; 

• with probability 0.002 between each pair of vertices in A and C; and 

• with probability 0.002 between each pair of vertices in B and C. 

It is not hard to verify that in this randomly generated graph, the (expected) cut conductance 
^ = (f)ciA) is independent of /?. As a result, the larger /3 is, we should expect the larger the well- 



connectedness A enjoys, and therefore the larger the gap Gap is in Theorem 1 This should lead to 
a better performance both in terms of accuracy and conductance when f3 goes larger. 

To confirm this, we perform an experiment on this randomly generated graph with various 
choices of /?. For each choice of /3, we run our PageRank-Nibblealgorithm with teleport probability 
a chosen to be the best one in the range of [0.001,0.3], starting vertex v chosen to be a random 
one in A, and e to be sufficiently small. We then run our algorithm 100 times each time against 
a different random graph instance. We then plot in Figure 2 two curves (along with their 94% 



confidence intervals) as a function of /3: the average cut conductance over ^ ratio, i.e., ^^, and 
the average clustering accuracy, i.e., |y| . This figure confirms that when /3 goes larger (so Gap 
becomes larger), we have a better performance on both cut conductance and accuracy. 

In the second experiment, we use the USPS zipcode data sel^^ that was also used in the work 



from WLS"'"12j . Following their experiment, we construct a weighted A;-NN graph with k = 20 out 
of this data set. The similarity between vertex i and j is computed as Wij = exp{—d'^-/a) if i is 
within j's k nearest neighbors or vice versa, and Wij = otherwise, where a = 0.2 x r and r denotes 
the average square distance between each point to its 20th nearest neighbor. 

This is a dataset with 9298 images of handwritten digits between to 9, and we treat it as 10 
separate binary-classification problems. For each of them, we pick an arbitrary starting vertex in it, 
let a = 0.003 and e = 0.00005, and then run our PageRank-Nibblealgorithm. We report our results 
in Table 1 For each of the 10 binary-classifications, we have a ground-truth set A that contains all 
data points associated with the given digit. We then compare the cut conductance of our output set 
4>ciS) against the desired cut conductance ^ = (J)q{A), and our algorithm consistently outperforms 
the desired one on all 10 clusters. Notice that it is possible in the real life to see that an output set 
S to have smaller conductance than A, because A is not necessarily the sparsest cut in the graph to 



begin with. In addition to cut conductance, one can also confirm from Table 1 that our algorithm 



See http : //en . wikipedia . org/wiki/Watts_and_Strogatz_model 



http : //www-stat . Stanford. edu/~tibs/ElemStatLearn/data. html 
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enjoys high precision and recall. 

We emphasize here again that it is not new to notice the PageRank local clustering methods 
(like ours) perform well in practice, and in many cases better than other approaches, see for instance 
|GS12l ILLDM09] . Our experiments in this section are designed only for the sake of demonstrating 
our theoretical discoveries in lTheorem II 



B Two Weaker Gap Assumptions 



As mentioned in Section 2| we can relax our gap assumption to 

Conn(^) dcf A(A) 



Gap 
Gap 



^ ^ log vol(^) 

Conn(^) dot 1 



* 



^ • rmix(^) 



>^(1) 
>n{i) 



or 



(Gap Assumption') 



(Gap Assumption") 



• Here A(^) is the spectral gap, that is the difference between the first and second largest 
eigenvalues of the lazy random walk matrix on G'[^]. (Notice that the largest eigenvalue 
of any random walk matrix is always 1.) Equivalently, X{A) can be defined as the second 
smallest eigenvalue of the Laplacian matrix of G[A]. 

• Here Tmix is the mixing time for the relative pointwise distance in G[A] (cf. Definition 6.14 
in jMR95| ) . that is, the minimum time required for a lazy random walk to mix relatively on 
all vertices regardless of the starting distribution. Formally, let Wa be the lazy random walk 
matrix on G[A], and vr be the stationary distribution on G[A] that is 7r(n) = deg^(u)/volA(^), 
then 



min < t € 



^>0 



max 

u,v 



{XvW'^){u)-7r{u) 



Notice that using Cheeger's inequality, we always have 



TTiU) 



MAf 



1 



< 0( 



< 



A(A) 



))^0(-l-). This 



logvol(A) — Vlogvol(A)/ 

is why (Gap Assumption") is weaker than (Gap Assumption')] which is then weaker than [(Gap] 



Assumption) 



We emphasize that the exact statement of our Theorem 1 is still true under those two weaker as- 



sumptions, leading to two strictly stronger results. To see this, one only needs to restudy Lemma 3.2 



under the two new assumptions, and we provide such analysis in Appendix D.2| 



C Algorithm for Computing Approximate PageRank Vector 



In this section we briefly summarize the algorithm Approximate-PR (see Algorithm 2) proposed by 



Andersen, Chung and Lang |ACL06| to compute an approximate PageRank vector. At high level, 
Approximate-PR is an iterative algorithm, and maintains an invariant that p is always equal to 
prg-r at each iteration. 

Initially it lets p = and r = s so that p = = pr^j satisfles this invariant. Notice that r does not 
necessarily satisfy r{u) < edeg('u) for all vertices u, and thus this p is often not an e-approximate 
PageRank vector according to [Definition 2.2 at this initial step. 

In each following iteration, Approximate-PR considers a vertex u that violates the e-approximation 
of p, i.e., r(u) > edeg(n), and pushes this r{u) amount of probability mass elsewhere: 

• a ■ r{u) amount of them is pushed to p{u); 
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r s — r- 



• 2d~?n') ^(^) amount of them is pushed to r{v) for each neighbor v of u; and 

• -^r{u) amount of them remains at r{u). 

One can verify that after any push step the newly computed p and r wih stih satisfy p = pr^ 
This indicates that the invariant is satisfied at ah iterations. When Approximate-PR terminates, 
it satisfies both p = prg-r and r{u) < edeg(u) for ah vertices u, so p must be an e-approximate 
PageRank vector. 

We are left to show that Approximate-PR terminates quickly, and the support volume of p is 
small: 



Proposition 2.3, For any starting vectors with \\s\\i < 1 ande £ (0, 1], Approximate-PR computes 



an e-approximate PageRank vector p = prg-r for some r in time O {-^) , with vol{supp{p)) < ,-.^ ■. . 

Proof sketch. To show that this algorithm converges fast, one just needs to notice that at each 
iteration ar{u) > aedeg(n) amount of probability mass is pushed from vector r to vector p, so the 
total amount of them cannot exceed 1 (because ||s||i < 1). This gives X]j=i deg(tii) < ^ where 
Ui is the vertex chosen at the i-th. iteration and T is the number of iterations. However, it is not 
hard to verify that the total running time of Approximate-PR is exactly 0( X]j=i deg(uj)), and 
thus Approximate-PR runs in time 0( — ). 

To bound the support volume, we consider an arbitrary vertex u £ V with p{u) > 0. This p{u) 
amount of probability mass must come from r{u) during the algorithm, and thus vertex u must be 
pushed at least once. Notice that when u is lasted pushed, it satisfies r{u) > ^^^edeg{u) after the 
push, and this value r{u) cannot decrease in the remaining iterations of the algorithm. This implies 
that for all u £ V with p{u) > 0, it must be true that r{u) > -^edeg{u). However, we must have 
||r||i < 1 because ||s||i < 1, so the total volume for such vertices cannot exceed -n — \-- D 



Algorithm 2 Approximate-PR(from [ACL06') 

Input: starting vector s, teleport probability a, and approximate ratio e. 
Output: the e-approximate PageRank vector p = prg-r- 
1: p -^ and r -^ s. 

while r{u) > edeg{u) for some vertex u £V do 
Pick an arbitrary u satisfying r{u) > edeg(ti). 
p{u) ^ p{u) + ar{u). 
For each vertex v such that {u, v) £ E: 

2deg(«) 
T*/ ,i 

end while 
return p. 



riv)^r{v) + ^^^^r{u). 
r{u) •(— ^^r{u). 



D Missing Proofs in Section 3 
D.l Proof of ILemma 3.11 



Lemma 3.1. There exists a set A^ (1 A with volume vol(^^) > 2Vol(A) such that, for any vertex 
V £ A^ , in a PageRank vector with teleport probability a starting at v, we have: 

Vpr^(u) < — . 1(3. i; 
^A " ' 
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In addition, there exists a non-negative leakage vector / G [0,1]^ with norm \\l\\i < — satisfying 



\fu G A, pr^{u) > pr^{u) — pri{u) . (3.2 



Leakage event. We begin our proof by defining the leaking event in a random walk procedure. 
We start the definition of a lazy random walk and then move to a PageRank random walk. At high 
level, we say that a lazy random walk of length t starting at a vertex u £ A does not leak from A 
if it never goes out of A, and let Leak(u, t) denote the probability that such a random walk leaks. 

More formally, for each vertex u £ V in the graph with degree deg(u), recall that in its random 
walk graph it actually has degree 2deg(ti), with deg(n) edges going to each of its neighbors, and 
deg(n) self-loops. For a vertex u £ A, let us call its neighboring edge (u, v) £ E a bad edge liv^A. 
In addition, if u has k bad edges, we also distinguish k self-loops at u in the lazy random walk 
graph, and call them had self-loops. Now, we say that a random walk does not leak from A, if it 
never uses any of those bad edges of self-loops. The purpose of this definition is to make sure that 
if a random walk chooses only good edges at each step, it is equivalent to a lazy random walk on 
the induced subgraph G[A] with outgoing edges removed. 

For a PageRank random walk with teleport probability a starting at a vertex u, recall that 
it is also a random procedure and can be viewed as first picking a length t £ {0,1,...} with 
probability a{l — a)*, and then performing a lazy random walk of length t starting from u. By 
the linearity of random walk vectors, the probability of leakage for this Pagerank random walk is 
exactly Ylt^o '^(^ ~ a)*Leak(u, t). 

Upper bounding leakage. We now give an upper bound on the probability of leakage. We start 
with an auxiliary lazy random walk of length t starting from a "uniform" distribution 7r^(u). Recall 
that t^a{u) = '^i(V^ for u £ A and elsewhere. We now want to show that this random walk leaks 
with probability at most 1 — t^P^ This is because, one can verify that: (1) in the first step of this 
random walk, the probability of leakage is upper bounded by ^ by the definition of cut conductance; 
and (2) in the i-th step in general, this random walk satisfies {ttaW^~^){u) < nAiu) for any vertex 
u £ A, and therefore the probability of leakage in the i-th. step is upper bounded by that in the 
first step. In sum, the total leakage is at most t^, or equivalently, ^^^^ 7ryi(n)Leak(ii, t) < t^. 

We now sum this up over the distribution of t in a PageRank random walk: 

J2 t^a{u) I ^ a(l - a)*Leak(u, t)] = ^ a(l - a)* | ^ TTA{u)Lea.'k{u, t) j 
ugA \t=0 ) t=0 \ueA / 

a{l-a)H^ = — . 

t=o " 

This implies, using Markov bound, there exists a set A^ <Z A with volume vol(^^) > 2^ol{A) 
satisfying 

yv £ A3, y a(l - a)*Leak(t;, t) < ^^^^ ~ "^ < — , (D.l) 

■'^— ' a a 

t=o 

or in words: the probability of leakage is at most — ^ ~°'' in a Pagerank random walk that starts at 
vertex v £ A^. This inequality immediately implies [(3.1) so for the rest of the proof, we concentrate 
on 



(3.2) 



^^Note that this step of the proof coincides with that of Proposition 2.5 from |ST08| . Our t$ is off by a factor of 
2 from theirs because we also regard bad self-loops as edges that leak. 
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Louver bounding pr. Now we pick some v £ A^, and try to lower bound pr^. To begin with, we 
define two \A\ x \A\ lazy random walk matrices on the induced subgraph G[A] (recall that deg(n) 
is the degree of a vertex and for u £ A we denote by deg^(n) the number of neighbors of n inside 

A): 

1. Matrix W. This is a random walk matrix assuming that all outgoing edges from A being 
"phantom", that is, at each vertex u £ A: 

• it picks each neighbor in A with probability ly-^ — 7-^, and 

• it stays where it is with probability 2^de^(u) ' 

For instance, let u be a vertex in A with four neighbors wi,W2,W3, W4 such that wi,W2,W3 £ A 
but W4 A. Then, for a lazy random walk using matrix W, if it starts from u then in the 
next step it stays at u with probability 3/8, and goes to wi,W2 and W3 each with probability 
1/8. Note that, for the rest 1/4 probability (which corresponds to W4) it goes nowhere and 
this random walk "disappears" ! This can be viewed as that the random walk leaks A. 

2. Matrix W. This is a random walk matrix assuming that all outgoing edges from A are 
removed, that is, at each vertex u £ A: 

• it picks each neighbor in A with probability „ .^ . s , and 

• it stays where it is with probability g- 

The major difference between W and W is that they are normalized by different degrees in the 
rows, and the rows of W sum up to 1 but those of W do not necessarily. More specifically, if we 
denote by D the diagonal matrix with deg(ii) on the diagonal for each vertex u £ A, and Da the 
diagonal matrix with deg^(u) on the diagonal, then W = D^^DaW. It is worth noting that, if 
one sums up all entries of the nonnegative vector XvW^, the summation is exactly 1 — Leak(t;,t) 
by our definition of Leak. 

We now precisely study the difference between W and W using the following claim. 

Claim D.l. There exists non-negative vectors It for all t £ {1,2, .. .} satisfying: 

\\lt\\i = Leak{v,t) — Leak{v,t — 1) , 
and 

XvW' = (xvW'-^ -it)w . 

Proof. To obtain the result of this claim, we write 

XvW' = ixvW'-^) D-^DaW 

' XvW'-^] W - ixvW'-^] {I - d-^DaW 



Now, we simply let It = (xvW*" ^\ {I — D ^Da)- It is a non-negative vector because deg^(n) is no 
larger than deg(ii) for all u £ A. Furthermore, recall that in the lazy random walk characterized by 
W , the amount of probability to disappear at a vertex u in the t-th step, is exactly its probability 
after a (t — l)-th step random walk, i.e., {xvW^~^){u), multiplied by the probability to leak in this 
step, i.e., 1 — df faV • Therefore, lt{u) exactly equals to the amount of probability to disappear in 
the i-th step; or equivalently, ||/i||i = Leak(t;,t) — Leak(f,t — 1). D 
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Now we use the above definition of k and deduce that: 

Claim D.2. Letting I = Yl'jLii^ ~ cty^^lj, we have \\l\\i < ^ and the following inequality 
vector holds coordinate-wisely on all vertices in A: 



on 



pr^\^>^a{l-af{Xv-l)W''=pr^-pri . 
t=o 

Proof. We begin the proof with a simple observation. The following inequality on vector holds 
coordinate-wisely on all vertices in A according to the definition of W: 

oo oo 

t=0 t=0 

Therefore, to lower bound pr^ I . it suffices to lower bound the right hand side. Now owing to 



Claim D.l 



we further reduce the computation on matrix W to that on matrix W: 

t 
XvW' = i^XvW'-' -lt'jW=(^ [xvW'-^ -lt-i)W-lt)w 

We next combine the above two inequalities and compute 



X 






prv\A > E"(l - ^y^vW' = X^a(l - a)* XvW -Y,ljW'-^+^ 



t=o 



t=o 

oo 



i=i 



Y, «(1 - c^YxvW - 5] a(l - aY J2 hW'-'^^ 



t=o 

oo 



t=0 

oo 



i=i 



Y, a(l - afx^W' - Yi^ - (^y~^h Yl "(^ " «)*^* 



t=o 

oo 



i=i 



t=i 

oo 



> ^ a(l - afx^W' - ^(1 - ay-Hj ^ a(l - afw' 



t=o 

oo 



oo 



t=0 



Y «(1 - a)M X. - ^(1 - ay-H, h^* = ^ a(l - af {x. - I) W 



t=o 



i=i 



t=o 



At last, we upper bound the one norm of / using Claim D.l again: 



oo 
oo 



^IKjIli = E(l - «)-'"^(Leak(t;,i) - Leak(y, j - 1)) 
2-^(1 -a) 2^ 



^a(l-a)-''"^Leak(u,j) < 



i=i 



a(l — a) a 



where the last inequality uses (D.l) 



D 



So far we have also shown (3.2) and this ends the proof of Lemma 3.1 
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D.2 Proof of iLemma 3.21 

(restated). When the teleport probability a < 72(3+10 voUA)) (^'^ rnore weakly when 



Lemma 



3.2 



a< 



n^o 1 1 u AW , or a < 0(^—) ), we have that 

9(3 + l0gV0l(A))' — VTmix/^' 



00 
Vu G A, pr^{u) = 2_] "(1 — 
t=o 



a 



''(--*)<"»^^ 



or a < 



HA) 



We 



Proof. We first prove this lemma in the case when a < 72(3+iogvol(yl)) ^^ " ^ 9(3+logvol(A)) - 
will then extend it to the weakest assumption a < 0[:jr—)- For a discussion on the comparisons 



between those three assumptions, see Appendix B 



Recall that we defined W to be the lazy random walk matrix on A with outgoing edges removed, 
and denoted by A = X{A) the spectral gap on the lazy random walk matrix of G[A] (cf. [Appendix B ). 
Then, by the theory of infinity-norm mixing time of a Markov chain, the length-t random walk 
starting at any vertex v £ A will land in a vertex u €z A with probability: 



iXvW'Xu) > 



deg^(n) 



Now if we choose Tr 







> 



3+logvol(A) 
A 



iXvW'Xu) > 



E«,eAdegA(^) 

degA(^) 
E^6AdegA(^) 

then for any t > Tq: 

9 deg^(n) 



(1 - A)* 



deg^(r;) 



miuy degj^iy) 



(1-A)*deg^(t;) 



14 



> 



9 deg^('u) 



10E.evdegA(^) - 10 vol(A) 
We then convert this into the language of PageRank vectors: 



(D.2) 



Y, «(1 - a)\xvW'){u) > (1 - a)^°a 5^(1 - a)\xvW'+^°){u 

9 deg^(n) 



t=o 



t=o 

00 



>(1 



aV°a 



E(i 



a) 



t=o 



10 vol(yl) 



(1 



a 



,^0 



9 deg^(n) 

10 vol(yl) 



At last, we notice that a < gjr holds: this is either because we have chosen a < 9(34.10 JoKA)) ' 
or because we have chosen a < ^^/o , 1 ^/ ^w and Cheeger's inequality A > <I>^/8 holds. As a 



72(3+logvol(A)) 
consequence, it satisfies that (1 — a)^" > 1 — aTo > | and thus (1 — q)'^° f jg voUA) 

We can also show our lemma under the assumption that a < 0(1/Tmix). In such a case, one 
can choose Tq = 0(rmix) so that (D.2) and the rest of the proof still hold. It is worth emphasizing 



\ 4 dcg^(M) 
— 5 vol(A) • 



M^r 



<0{ 



HA) 

log vol (A) 



) < of—!—), this last assumption is the weakest 



D 



that since we always have logvoi(A) 
one among all three. 

^^Here we have used the fact that muiy degj^{y) > 1. This is because otherwise G[A] will be disconnected so that 
^ = (l>s{A) = 0, A(A) — and TmixiA) — oo, but none of the three can happen under our gap assumption Gap > £1(1). 
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E Missing Proofs in Section 4 

E.l Proof of ILemma 4.11 

4.1, Letting a = 0(^ • Gap), among all sweep sets Sc = {u G V : p{u) > c y^f/^^ } /' 



Lemma 



dcg(») \ ^Q^ 



c £ [|i j]; i^ere existe one, denoted by Sc* , with cut conductance </>c(5'c*) = 0(A/^/Gap). 



Proof. We only point out how to extend our proof in the exact case (see Section 4) to the case 



when p is an e- approximate PageRank vector. For any set Siu C 5 C 5i /§, we compute that 

PiS) = pr^.-r{S) = a{xv - r)iS) + (1 - a){pW){S) 
= aixv - r){V) + ar{V \ S) + {1 - a){pW){S) 
< a{xv - r){V) + a {r{V \ A) + r{A \ S)) + (1 - a){pW){S) 
= ap{V) + a {r{V \ A) + r{A \ S)) + (1 - a){pW){S) 

where in the last equality we have used {xv — '^)(^) = p{V), owing to the fact that p = {xv 
r) Yl'^o '^(^ ~ o^)*^S but VF is a random walk mat rix that preser ves the total probability mass. 



We next notice that r{V \ ^) < — according to Corollary 3.3 as well as 



r{A\S) < evol{A \ S) (according to [Definition 2^ ) 

2^ 




+ 8^ vol(^) (according to Lemma 3.4 and S 5 5*1/4 



(using a < a from the our choice in Appendix D.2) 



(using our choice of e < jqVoI{A) in Section 3.2) 

Therefore, we have 

/2^ 7^\ 

PiS) < ap{V) + a( — + ^^^) + (1 - a)ipW){S) 
\ a Q / 

= ap{V) + 2.7^ + (1 - a){pW){S) 
^ (1 - a)p{S) <a-p{V\S) + 2.7^ + (1 - a){pW){S) 



(1 - a)p{S) < 4.7* + (1 - a)ipW){S) (using Corollary 3.3) 



p{S) < 5.3'!' + {pW){S) (using a < ^ again 



In sum, we have arrived at the same conclusion as |(4.1) in the case when p is only approximate, 



and the rest of the proof follows in the same way as in the exact case. D 

E.2 Proof of [Theorem II 

We are ready to put together all previous lemmas to show the main theorem of this paper. 

Theorem [ll // there exists a non-empty set A C V such that (j)c{A) < * and Gap > Q{1), then 
there exists some A^ <^ A withMol{A^) > ^^ollA) such that, when choosing a starting vertex v £ A^ , 
the PageRank-Mibble algorithm outputs a set S with 

1. vo\{S\A)<0[^)^o\{A), 

2. vol(A\S)<0(4)vol(A), 
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nd 



3. (/)c(5) < 0{^/WJg^), ana 
with running time O C^.g^ ) — 0( ™,^ )■ 



Proof. As in Algorithm 1, we choose a = 0(^ • Gap) to satisfy the requirements of ah previous 
lemmas. We define A^ according to Lemma 3.1| and compute an e-approximate PageRank vector 
starting from v where £ 



lOvolo 

Next we study all sweep sets S'^ 

-vol(A) 



satisfies 



(3.3) 



{u £ supp(p) : p{u) > 



cdeg(M) 
volo 



} for c G [ 



since voIq G [ ™2 )'^oK^)] > ^^ such sweep sets correspond to Sd = {u £ supp(p) : p{u) > '^^^u^)' } 



for some d G [ 



16' 2J- 



Therefore, the output S is also some Sd sweep set with d G 



Notice that 

ddcg{u) ' 

and 



"_L 1 

-16' 2 



Lemma 3.4 guarantees the first two properties of the theorem. 

On the other hand, [Lemma 4.1 guarantees the existence of some sweep set Sd* satisfying 

is also a sweep set S'^ with c G [3^, |]) ^^^ 
This immediately implies that 



(f)^{Sd*) = 0(V* • Gap). Since d* G [|,|], this 5, 



■>d* 



must be considered as sweep set candidate in our Algorithm 1 



the output S of Algorithm 1 must have a cut conductance 4>c{S) that is at least as good as 
(f>ciSd*) = 0{^/^ ■ Gap), finishing the proof for the third property of the theorem. 



At last, as a direct consequence of [Proposition 2T3| and the fact that the computation of the 
sector 

oCw 



approximate PageRank vector is the bottleneck for the running time, we conclude that Algorithm 1 
runs in time 0(^^^) = 0(^^)- 



D 



F Missing Proofs in Section 5 



In this section we show that our cut conductance analysis for Theorem 1 is tight. We emphasize 
here that such a tightness proof is very non-trivial, because one has to provide a graph hard instance 
and start to upper and lower bound the probabilities of reaching specific vertices up to a very high 
precision. This is different from the mixing time theory on Markov chains, as for instance, on a 
chain of £ vertices it is known that a random walk of 0{i'^) steps mixes, but in addition we need 
to compute how faster it mixes on one vertex than another vertex. 



In Appendix F.l we begin with some warm-up lemmas for the PageRank vector on a single 



chain, and then in Appendix F.2 we formally prove Lemma 5.1 with the help from those lemmas. 



F.l Useful Lemmas for a PageRank Random Walk on a Chain 

In this subsection we provide four useful lemmas about a PageRank random walk on a single chain. 
For instance, in the first of them we study a chain of length i and compute an upper bound on 
the probability to reach the rightmost vertex from the leftmost one. The other three lemmas are 
similar in this format. Those lemmas require the study of the eigensystem of a lazy random walk 
matrix on this chain, followed by very careful but problem-specific analyses. 

Lemma F.l. Let i be an even integer, and consider a chain of i + 1 vertices with the leftmost 
vertex indexed by and the rightmost vertex indexed by i. Let pr^^ be the PageRank vector for a 
random walk starting at vertex with teleport probability a = jz for some constant 7. Then, 



P^xM) < 



1 
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1 



27 
7r2/4 + 7 



+ 



27 
vr^ + 7 



+ 



1 
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Proof. Let us define 



W 



4 


1 

! 

2 

1 

4 


1 

4 

1 

2 


1 


1 
f 



2 2 



/ 



to be the (^ + 1) x (i + 1) lazy random walk matrix of our chain. For k = 0,1, . . . ,i, define: 



Afc 



1 + cos( 



— — = cos — - 
2 \2l 

'irku 



Vk{u) = deg(n) • cos (— ^j {u = 0,1, . . . ,t) , 



(F.ll 



where deg(u) is the degree for the n-th vertex, that is, deg(O) = deg(£) = 1 while deg(i) = 2 for 
i G {1, 2, . . . , £ — 1}. Then it is routinary to verify that Vk -W = \k- v^ and thus 

Vk is the k-th (left-) eigenvector and Xk is the k-th eigenvalue for matrix W . 

We remark here that since W is not symmetric, those eigenvectors are not orthogonal to each other 
in the standard basis. However, under the notion of inner product {x, y) = X]j=o ^(^)y(0 deg(i)~^, 
they form an orthonormal basis. 

It now expand our starting probability vector xo under this orthonormal basis: 



Xo = (l,0,0,...,0) 



As a consequence when t > 0, using A^ = 0: 



— Uo + 2 ^ Vfc + f <? J 



XoW' = ^iv, + 2Y,{XkYvA 



Now it is easy to compute the exact probability of reaching the right-most vertex t. 
Xo^* W = ^, U W + 2 ^(Afc)*z;fc(^) ) = 1 ( 1 + 2 ^ cos^* (^) cos(7rfc) 



k=l 



fc=l 



\ fc=l 



,2t f'^\( i\k 
2i) 



1 
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cos- I ^ (-1)'^ < TT7 1 - 2cos^*(-) + 2COS 



TT 



2t 



21 \ ^^ \2£J' ' - 2e\ ^21 

\ fc=i / 

At last, we translate this language into the PageRank vector pr^g and obtain 



TT 



P^xoi 



oo ^ / oo 

t=0 V t=0 



a)Ml-2cos2*(-j+2cos' ^^ 



1 / 2a 2a 

2£ V" ^ ~ l-(l-a)cos2(§) ^ l-(l-a)cos2(f) 



1 



27 +^^ + o,'l 



2£ V ^^/4 + 7 ' vr2 + 7 
We remark here that the last inequality is obtained using Taylor approximation. 



n 
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Lemma F.2. Let i be an even integer, and consider a chain of i + 1 vertices with the leftmost 
vertex indexed by and the rightmost vertex indexed by i. Let pr^^ be the PageRank vector for a 
random walk starting at vertex with teleport probability a = ^ for some constant 7. Then, 



> 



1 



^''^°l2;^n^ vr2+7 



1 



27 



o 



Proof. Recall from the proof of Lemma F.l that for t > we have 

XoW' = ^(vo + 2^{XkYvk] . 



k=l 



Now it is easy to compute the exact probability of reaching the middle vertex ^ ■ 



fc=l 



U-^E 



fc=i 



n, 1 'iTk\ /irk 



I 



E 

9=1 



i :.2E-''(5)(-i)' >Ui-2 



,2t 



;cos 



vr 



\2e. J 

At last, we translate this language into the PageRank vector pr^^ and obtain 



00 „ ^ / „ 00 

Y^ a(l - a)\,w\-^ > - av,[-) + ^ a(l - af (l - 2 

t=0 \ i=0 

/^\ ^ 2a 



cos 



2t 



vr 



> 



1 — (1 — a) cos 



27 



vr^ + 7 



O 



£2 



We remark here that the last inequality is obtained using Taylor approximation. 
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Lemma F.3. Let i be an even integer, and consider a chain of i + 1 vertices with the leftmost 
vertex indexed by and the rightmost vertex indexed by i. Let pr^^,,^ be the PageRank vector for a 
random walk starting at the middle vertex i/2 with teleport probability ol= ji for some constant 7. 
Then, 



Prxi/2[2) < ^(1 + ^/7 + 



©)■ 



Proof. Following the notion of Xk and Vk in (F.l), we expand our starting probability vector Xi/2 
under this orthonormal basis: 

1 / '/'"' 

Xe/2 = (0, . . . , 0, 1, 0, . . . , 0) = - Uo + 2 J] {-l)''v2q + {-lY^^ve 



9=1 



Then similar to the proof of Lemma F.l we have that for all t > 



Xe/2W^ 



- Lo + 2 5] (-1^2,)% 
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Now it is easy to compute the exact probability of reaching the middle vertex 



2- 



^'iM'^ = h [ A'^^^"^i-mM,M'^\ 4 (i + 2'v (-!)'»- (^) cos (^; 




9=1 

Notice that in the last equality we have used a recent result on power sum of cosines that can be 
found in Theorem 1 of |Merl2j . Next we perform some classical tricks on binomial coefficients: 

^*/'J f 2t \ f2t\ ^*/'V 2t 

^ \t + ki) ^ \t ) ^ "^ ^ \t + ke 

k=-\t/t\ ^ ' ^ ^ ^ k=i ^ 

-^ t)^'^f^^l\\t + {k-l)i+ 1) ^\t+{k- 1)1 + 2) ^'"^[t + ki 
,2t\ 1 ^ /2A 2^* 22* 

<( . l + ^Ei i<^ + 



tj ^^,\QJ ^/v^t 



and in the last inequality we have used a famous upper bound on the central binomial coefficient 
that says ( ^ ) < -^ for any integer t > 1 and p G {0, 1, ... , 2t}. 

At last, we translate this language into the PageRank vector pr-^^,^ and obtain 



00 



X^ „J^ ^M 



P^x./2 U = Va(l-a 



t=o 



\t=i 




<a + 




< 



iO-^^°G)) ^ 



Y^-log(l-a) 
We remark here that the last inequality is obtained using Taylor approximation. D 

Lemma F.4. Consider an infinite chain with one special vertex called the origin. Note that the 
chain is infinite both to the left and to the right of the origin. Now we study the PageRank random 
walk on this infinite chain that starts from the origin with teleport probability a = j2; o.'^d denote 
by pry.g (0) be the probability of reaching the origin. Then, 



p-.M>^-oQ 



Proof. As before we begin with the analysis of a lazy random walk of a fixed length t, and will 
translate it into the language of a PageRank random walk in the end. Suppose in the t actual 
number of steps, there are ti < t number of them in which the random walk moves either to the 
left or to the right, while in the remaining t — ti of them the random walk stays. This happens 
with probability (^ )2~*. When ti is fixed, to reach the origin it must be the case that among ti 
left-or-right moves, exactly ti/2 of them are left moves, and the other half are right moves. This 
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happens with probabihty (^ \^2 *i. In sum, the probabihty to reach the origin in a t-step lazy 
random walk is: 



t 

E 



(i 



tl=0 ^ ^^ \ i/ / y^Q 



t/2 



2-(^~^j2-*^ = j;(^"](; \2'^y-' 



2y\ / t 



1 (2t-l)!! 1 (2t)! 



{t)\ 2* 



(t)! t!22t \t 



2t\ o, 1 



4f 



Here in the last inequality we have used the famous lower bound on the central binomial coefficient 
that says (^) > -7t= for t > 1. At last, we translate this into the language of a PageRank random 
walk: 



°° -^ roo -j^ a-^/TT ( 1 — erf (Y^^^3g(l^^^ 

pry^oiO) > a+^a(l-a)*-= > a+ / a(l-a)*-=(it = a+ ^—^===== ^ > 

^ V4i Jt=i v4t 2iy-log(l-a) 



2£ V^2^ 



Here in the last inequality we have used the Taylor approximation for the Gaussian error function 
erf. D 

F.2 Proof of ILemma 5.11 



We are now ready to show the proof for Lemma 5.1 



Lemnia|5.l[, For any 7 € (0,4] and letting a = 7/^^, ther e exists so me constant cq such that when 



studying the PageRank vector pra starting from vertex a in 



Figure 1 



it satisfies that ggj > g[|. 



We divide the proof into four steps. In the first step we provide an upper bound on g^^4^ for 



vertex c, and in the second step we provide a lower bound on 

require a careful study on a finite chain (and in fact the top chain in Figure 1 ) which we have 



pr-ajb) 
dcg(6) 



for vertex b. Both these steps 



already done in Appendix F.l They together will imply that 



^^'^(^)>(1 + 1^(1))-^^'^^^) 



In the third step, we show that 



deg(6) 



prgjd) 
deg(d) 



deg(c) 



> (1-0(1)) 



prgib) 
deg(6) 



(F.2) 



(F.3) 



that is, the (normalized) probability for reaching d must be roughly as large as b. This is a result 
of the fact that, suppose towards contradiction that j(J\ is much smaller than j-^rry, then there 
must be a large amount of probability mass moving from b to d due to the nature of PageRank 
random walk, while a large fraction of them should remain at vertex d due to the chain at the 



bottom, giving a contradiction to 



deg(d) 



being small. 



And in the last step, we choose the constants very carefully to deduce ^i^jl,,,{ > ^^Jl/I out of 



deg(d) deg(c) 



lFJ)\ and \(¥l)\ 

Step 1: upper bounding pra(c)/ deg(c). In the first step we upper bound the probability of 
reaching vertex c. Since removing the edges between b and d will disconnect the graph and thus 
only increase such probability, it suffices for us to consider just the top chain, which is equivalent 
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to the PageRank random walk on a finite chain of length i + 1 studied in Lemma F.l In om' 
language, taking into account the multi-edges, we have that 



deg{c) - n/£2£\ 7r2/4 + 7 tt^ + -/ \P 

2n\ 7r2/4 + 7 71^+7 Ki^JJ ^ ' 

Step 2: lower bounding pra(6)/ deg(6). In this step we ask for help from a variant of 



Lemma 3.1 Letting -pr ^ be the PageRank vector on the induced subgraph G\A\ starting from s with 



teleport probability a, then Lemma 3.1 (and its actual proof) implies that pra{b) > pr^{b) — pri(b) 
where I is a vector that is only non-zero at the boundary vertex b, and in addition, ||/||i = l{b) < — 
since a G ^4^ is a good starting vertex. We can rewrite this as 

pra{b) > prjb) pr^ib) . 

a 

Next we use lLemma F.2l and lLemma F.3l to deduce that: 

At last, we normalize this probability by its degree deg(6) = 2n/£ + ^n and get: 



2n \ TT^ -|- 7 



>T^(i-rr^-0(^^')] ■ (F-5) 



Step 3: louver bounding pr^ (d) / deg(d). Since we have already shown a good lower bound 
on pra{b)/ deg{b) in the previous step, one may naturally guess that a similar lower bound should 
apply to vertex d as well because b and d are neighbors. This is not true in general, for instance 
if d were connected to a very large complete graph then all probability mass that reached d would 
be badly diluted. However, with our careful choice of the bottom chain, we will show that this is 
true in our case. 

Lemma F.5. Let p* '^ g[|, then e^ther g|f > (1 - c^)p* or g|| > ^p*(l - 0{\)). 

Proof. Throughout the proof we assume that j^°/ i < (1 — ci)p* because otherwise we are done. 

Therefore, we only need to show that ^^"(2 > ^^p*{l — 0(|)) is true under this assumption. 

We first show a lower bound on the amount of net probability that will leak from A during the 
given PageRank random walk, i.e., NetLeakage = J2ui^AP'''a{u)- Loosely speaking, this net proba- 
bility is the amount of probability that will leak from A, subtracted by the amount of probability 
that will come back to A. 

We introduce some notation first. L et p^^' = Xa W^ be the lazy random walk vector after t 
steps, and using the similar notation as Lemma 4.1, we let p^^'{b,d) = ^ — W be the amount of 



dcg(b) 



probability mass sent from btod per edge at time step i to t -|- 1, and similarly p^^' {d, b) = % — k4 • If 
the PageRank random walk runs for a total of t steps (which happens with probability a(l — a)*). 
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then the total amount of net leakage becomes X]i=o {p i^^ d) ~ P {d, ^)) • ^ra. This gives another 
way to compute the total amount of net leakage of a PageRank random walk: 



t-i 



NetLeakage = ^a(l-a)*^(p(^)(^f^)-P^^^(c^,^)) • ^n = ^ (p«(6,d) -pW((i, 6)) ■ ^n ^ a(l - a)* 

t=0 i=0 i=0 t=i+l 

oo ^ oo 

= Yl (P^'^ (^' ^) " P^'^ {d,b))-^n-{l- a)^+i = ^^^ ^ a(l - a)^ (p(^) (6, d) - p^''> {d, b)) ■ ^n 

i=0 i=0 

l-a(pra{b) pra{d)\ I- a 



Q \deg(6) deg((i)/ a 

Now we have a decent lower bound on the amount of net leakage, and we want to further lower 
bound pra{d) using this NetLeakage quantity. We achieve so by studying an auxiliary "random 
walk" procedure g''*-', where q^^' = p^^' = Xa, but 

{0, if u ^ b,u ^ d; 

p^\d,b)-^n, iiu = b] 
-p^^\b,d) -^n, i{u = d. 

It is not hard to prove by induction that for all t > 0, it satisfies q^^'(u) = p^*>{u) for u G A and 
q(^\u) = for u AJ^ Then we have that: 

oo 

t=o 

is precisely the vector that is zero everywhere in A and equal to pra everywhere in V \ A. We 
further notice that 

oo oo /*~^ \ 

A = ^ a(l - ay (gW - pW) =Y.a{l- a)* ^ 5«W^*— ^ 

t=0 t=0 \j=0 / 

oo oo oo / oo \ 

= ^ ^ a(l - a)'^+^+i<5»^^ = E "(1 - ")' E(l - ^y^'^^'^ W' . 



k=0 i=0 k=0 \i=0 



Therefore, as long as we define 6 = Yl'i^oi^ ~ o^y^^^^''' = ^^Yl'i^o'^i^ ~ ct)**^ i ^^ can write 
A = prs also as a PageRank vector. We highlight here that (5 is a vector that is non-zero only at 
vertex b and d (and in fact 6{d) > and 6{b) < 0), such that 5{d) + 5{b) = NetLeakage according 
to the first equality in |(F.6) 



Now we are ready to lower bound pra{d). Using the linearity of PageRank vectors we have 

pra{d) = A{d) = prs{d) = pr(s(d)xd+Hb)xb)(d) = S{d) ■ prd{d) + 6{b) ■ pn{d) > {6{d) + 6{b)) ■ prd{d) 

where in the last inequality we have used pri,{d) < pr^id) which is true by monotonicity. Then we 
continue 

/ 1 — a \ f \ — a. \ f 7r7 / 1 

praid) > (NetLeakage) •prrf(d) > (——cip*^n\ ■ pr^id) > l^^cip*^nj " ( ^ -0(^^ 



^^This is obvious when t = 0. For g*'+^\ we compute p'*+^^ = p<*'W and g(*+^) = (j<*'W + 5'''. Based on the 
inductive assumption that the claim holds for q^^', it is automatically true that for u (^ A\ {&}, p^^'^^'{u) — q^^^^'{u), 
and u € V \ {Au {d}) we have q^^'^^'{u) — 0. For u = b or u = d, one can carefully check that S^*' is introduced to 
precisely make g'*+^'(6) = p^'+^'(6) and g'*+^'(d) = 0, so the claim holds. 
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using (F.6)| in the second inequality and Lemma F.4| in the last inequahty, so we conclude that 



praid) > 'fp*^n{i - 0(1)) and then f^ > ^p*(l - 0(|)). 



dcg(a!) 



D 



Step 4: putting it all together. We now define (using the fact that 7 > and 7 < 4) constant 
C2 to satisfy 



1-C2 



1 - „2^ -u 27 

7r2/4+7 ~'~ 77^+7 



1 



27 

7?+7 



< 1 . 



This constant is asymptotically the ratio between 
satisfies that (using the fact that "^i^ = o(l)) 



(F.4) and (F.5) , so once we let p* 



prg (b) 
deg(b) 



it 



prgjc) 
deg(c) 



<(l-C2)/(l + o(l)) . 



Next, if we choose ci = ^ 



and Co = 

praid) 



m 



Lemma F.5 



C2 



, ., > min i 1 -A — O 

deg(d) - 1 2 ' 



this gives 
1 



P 



It is now clear from the above two inequalities that in the asymptotic case, i.e.. w 
sufficiently large, we always have jfj{ > ^f) • This finishes the proof of 



nen n,i are 



Lemma 5.1 



G Related work 



Most relevant to our work are the ones on local algorithms for clustering. On the theoretical side, 
after the first such result |ST041 IST08J , |ACL06j simply compute a Pagerank random walk vector 
and then show that one of its sweep cuts satisfies cut conductance 0{^/Wlogn). The computation 
of this Pagerank vector is deterministic and is essentially the algorithm we adopt in this paper. 
|AP09l IGT12| use the theory of evolving set from |MP03j . They study a stochastic volume- biased 
evolving set process that is similar to a random work. This leads to a better (but probabilistic) 
running time and but essentially with the same cut conductance guarantee. 

The problem of conductance minimization is UGC-hard to approximate within any constant 
factor JCKK^06| . On the positive side, spectral partitioning algorithms output a solution with 
conductance 0{V^) where this idea traces back to |Alo86j and [SJ89]; Leighton and Rao |LR99j 
provide a first O(logre) approximation; and Arora, Rao and Vazirani |ARV09j provide a 0{^/\ogn) 
approximation. Those results, along with recent improvements on the running time by for instance 
[AHKlOt IAK071 IShe09| . are all global algorithms: their time complexities depend at least linearly 
on the size of G. There are also seminal work in machine learning to make such global algorithm 
practical, including the seminal work of [LCIO] for spectral partitioning. 

Less relevant to our work are supervised learning on finding clusters, and there exist algorithms 
that have a sub-linear running time in terms of the size of the training set ZCZ"'"09l ISSS08J . 

On the empirical side, random-walk-based graph clustering algorithms have been widely used in 
practice |GS121 IGLMYl"T| lACE"'"13l IAGM12] as they can be implemented in a distributed manner 
for very big graphs using map-reduce or similar distributed graph mining algorithms |LLDM09t 
IGLMYlH IGS121 IAGM12 | . Such local algorithms have been applied for (overlapping) clustering of 
big graphs for distributed computation |AGM12j . or community detection on huge Youtube video 
graphs [GLMYllj . 
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More recently, [WLS"'"12 studied a variant of the PageRank random walk and performed sup- 
portive experiments on it. Their experiments confirmed the first two properties in our Theorem 1 
but their theoretical results are not strong enough to confirm it. This is because there is no well- 
connectedness assumption in their paper so they are forced to study random walks that start from 
a random vertex selected in A, rather than a fixed one like ours. In addition, they have not argued 



about the cut conductance (like our third property in Theorem 1) of the set they output. 



Clustering is an important technique for community detections, and indeed local clustering 
algorithms have been widely applied there, see for instance |AL06j . Sometimes researchers care 
about finding all communities, i.e., clusters, in the entire graph and this can be done by repeatedly 
applying local clustering algorithms. However, if the ultimate goal is to find all clusters, global 
algorithms perform better in at least in terms of minimizing conductance |LLDM09t IGLMYlT] 
1^5121 IXnMT2l ILLMIO] . 
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